From: Bill Mark [billmark@graphics.stanford.edu]
Sent: Thursday, October 11, 2001 7:16 PM
To: Bill Dally
Cc: Ian Buck; lance@leland.Stanford.EDU; ujk@leland.Stanford.EDU;
Billmark@lambert.stanford.edu; pmattson@leland.Stanford.EDU;
serebrin@stanford.edu; merez@stanford.edu; jowens@graphics.stanford.edu;
horowitz@chroma.stanford.edu; hanrahan@graphics.stanford.edu
Subject: Re: Brook - multiple input and output streams

On Thu, 11 Oct 2001, Bill Dally wrote:

> First, by the very one-dimensional nature of a stream, the use of
> persistant state is one dimensional - early elements affect later elements
> but not vice-versa.  Thus, the computation on persistant state can be
> converted to a log-depth parallel prefix computation - most of which takes
> place locally in one processing "element" - and only the top levels of the
> tree are between elements.

My understanding is that this property only holds if the computation
is associative.  If the programming model includes persistent kernels,
there is no guarantee that the computation will be associative.

So perhaps we should include some more restricted mechanism that makes
it clear that this type of computation must be associative.  For
example, we could include an explicit mechanism for requesting
parallel-prefix computations.  The reduction capability that Ian has
proposed is an even more restricted version of this.  Although the
reduction capability is less expressive than a general prefix
computation, it has the advantage that it does not place any
restrictions on the partitioning of stream elements across processors.

-Bill M.

>
> Key to managing this type of communication - which we need anyway for
> reductions - like the "max" in Ian's example - is setting the proper
> traversal order.  The traversal order wants to be in order of least
> expensive communication/synchronization.  Roughly speaking the order is -
> across "clusters" in a clustered processor (MIMD or SIMD), across time in
> one processor, across "nearby" processors (e.g., on a chip or board),
> across distant processors.  This order means that very little of the log
> tree structure crosses the expensive links.
>
> Whether a processor is SIMD or MIMD is immaterial to the cost of
> communication.  What really matters is how tightly synchornized they are -
> not that they are all doing the same instruction.  Processors that are
> synchronized can exchange some data directly - without buffering or
> waiting.  There are lots of good technologies to synchronize MIMD
> processors (read our M-Machine papers) - SIMD processors are inherently
> synchronized.
>
> ----Bill
>
> At 11:46 PM 10/10/2001 -0700, Bill Mark wrote:
> >On Wed, 10 Oct 2001, Bill Dally wrote:
> >
> > >...
> > > The conditional and unconditional << and >> operators in KernelC are pretty
> > > useful.  You should consider adopting this model in Brook - where the
> > > kernel is 'active' for the duration of the stream - rather than a single
> > > element - and can explicitly read from inputs and write to outputs.
> > > ...
> >
> >"Persistent kernels" are powerful, but allowing them does present
> >some challenges, especially from a programming-model point of view.
> >
> >What happens when you have more than one processing unit operating on
> >the stream?  There are duplicate copies of the persistent state,
> >possibly with different values.  On Imagine, you have to add explicit
> >communication between the processors to do something reasonable in
> >this situation.  My impression is that this communication code is
> >often designed for a particular number of processors.  It's not clear
> >to me that there's any simple way to hide the number of processors.
> >
> >If the language provided the "illusion" of one processor, the compiler
> >might be able to insert the appropriate communication, but I suspect
> >that it would then be very easy for the programmer to create code that
> >would run very slowly when the number of stream processors is high.
> >We should look at this question in more detail; my instinct may be
> >wrong here.
> >
> >Allowing communication between processing units also seems to require
> >that the processors be SIMD.  SIMD's ability to simply perform
> >communication of this type is perhaps its greatest advantage over
> >MIMD.
> >
> >Bill M.
> >
> >
> >
>
> --------------------------------------------------------------------------
> Bill Dally                billd@csl.stanford.edu             (650)725-8945
> Professor of Electrical Engineering and Computer Science  FAX(650)725-6949
> Computer Systems Laboratory, Stanford University
> Gates Room 301
> Stanford, CA  94305                         http://csl.stanford.edu/~billd
> --------------------------------------------------------------------------
>